Worcester County
AppendixFor RecurrentBayesianClassifierChainsForExact Multi-LabelClassification
Ascalculatingthese residuals requires out-of-sample inference, we fit the models and half of the data and evaluate on the other half, before switching the training and testing sets and training/inferring again. We used the Adam optimizer [4] and PyTorch's exponential learning rate scheduler with gammasetto0.99.
VeriFastScore: Speeding up long-form factuality evaluation
Rajendhran, Rishanth, Zadeh, Amir, Sarte, Matthew, Li, Chuan, Iyyer, Mohit
Metrics like FactScore and VeriScore that evaluate long-form factuality operate by decomposing an input response into atomic claims and then individually verifying each claim. While effective and interpretable, these methods incur numerous LLM calls and can take upwards of 100 seconds to evaluate a single response, limiting their practicality in large-scale evaluation and training scenarios. To address this, we propose VeriFastScore, which leverages synthetic data to fine-tune Llama3.1 8B for simultaneously extracting and verifying all verifiable claims within a given text based on evidence from Google Search. We show that this task cannot be solved via few-shot prompting with closed LLMs due to its complexity: the model receives ~4K tokens of evidence on average and needs to concurrently decompose claims, judge their verifiability, and verify them against noisy evidence. However, our fine-tuned VeriFastScore model demonstrates strong correlation with the original VeriScore pipeline at both the example level (r=0.80) and system level (r=0.94) while achieving an overall speedup of 6.6x (9.9x excluding evidence retrieval) over VeriScore. To facilitate future factuality research, we publicly release our VeriFastScore model and synthetic datasets.
Appendix For Recurrent Bayesian Classifier Chains For Exact Multi-Label Classification
For the experiments described in Section 3.5 of the main paper, all methods which required a Bayesian These residuals are obtained by first training a separate classifier per each class, and then calculating the residual as the error between the predicted and ground truth class. Training Hyperparameters For each method, we used a batch size of 128 and a learning rate of 0.001. Each method was trained until convergence for 200 epochs. To validate that our "non-noisy" class conditioning approach is RBCC, and the class ordering implies that each class is predicted before its parent classes. Results are shown in Figure 1.